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Abstract: Nowadays, many decision support applications need to exploit data that are not only numerical or symbolic, 
but also multimedia, multistructure, multisource, multimodal, and/or multiversion. We term such data complex 
data. Managing and analyzing complex data involves a lot of different issues regarding their structure, storage 
and processing, and metadata are a key element in all these processes. Such problems have been addressed 
by classical data warehousing (i.e., applied to "simple" data). However, data warehousing approaches need 
to be adapted for complex data. In this paper, we first propose a precise, though open, definition of complex 
data. Then we present a general architecture framework for warehousing complex data. This architecture 
heavily relies on metadata and domain-related knowledge, and rests on the XML language, which helps storing 
data, metadata and domain-specific knowledge altogether, and facilitates communication between the various 
warehousing processes. 



1 INTRODUCTION 



Data warehousing and OLAP (On-Line Analytical 
Processing) technologies (Inmon, 2002; Kimbairand] 
Ross, 2002) are now considered mature. They are 
aimed, for instance, at analyzing the behavior of a 
customer, product, or company, and may help mon- 
itoring one or several activities (commercial or med- 
ical pursuits, patent deposits, etc.). More precisely, 
they help analyzing these activities under the form of 
numerical data. However, in real life, many decision 
support fields (customer relationship management, 
marketing, competition monitoring, medicine...) need 
to exploit data that are not only numerical or sym- 



topic, which would be available under various formats 
(videos, images, sounds, texts, etc.). 

Complex data might be structured or not, and are 
often located in different and heterogeneous data 
sources. Browsing these data necessitates an adapted 
approach to help collect, integrate, structure and even- 
tually analyze them. A data warehousing solution is 
interesting in this context, though adaptations are ob- 
viously necessary to take into account data complex- 
ity. Measures might not necessarily be numerical, for 
instance. Data volumetry and dating are also other ar- 
guments in favor of the warehousing approach. Fur- 
thermore, complex data produce different kinds of in- 
formation that are represented as metadata. These 
metadata, alone with domain-snecific knowledge, are 



for warehousing complex data (Section 15). This 
model heavily relies on metadata and domain-specific 
knowledge. It also rests on the XML language that 
we use for different purposes: to store complex data, 
if necessary; to store metadata and knowledge about 
these complex data; and to facilitate communication 
between the different warehousing processes — ETL 
(Extract, Transform, Load) and integration, adminis- 
tration and monitoring, and analysis and usage. We 
finally conclude this paper and provide research per- 
spectives (Section|4|i. 



2 A DEFINITION OF COMPLEX 
DATA 

Many researchers in several communities start to 
claim they work on complex data. However, this 
emerging concept of complex data varies a lot, even 
within a single research community such as the 
database community. Hence, in a first step, we per- 
formed an extensive litterature study to identify all 
the different sorts of data researchers dealt with. We 
particularly, but not exclusively, focused on publica- 
tions and events that explicitely mentionned the terms 
"complex data", which particularly emerge in the data 



mining field (Gangarski and Trousse, 20041. After 



compiling all this information, we were able to pro- 
pose a first definition and concluded that data could 
be qualified as complex if they were: (1) multiformat, 
i.e., represented in various formats (databases, texts, 
images, sounds, videos...); and/or (2) multistructure, 
i.e., diversely structured (relational databases, XML 
document repositories...); and/or (3) multisource, i.e., 
originating from several different sources (distributed 
databases, the Web...); and/or (4) multimodal, i.e., de- 
scribed through several channels or points of view (ra- 
diographies and audio diagnosis of a physician, data 
expressed in different scales or languages...); and/or 
(5) multiversion, i.e., changing in terms of definition 
or value (temporal databases, periodical surveys...). 

However, it appeared in subsequent meetings with 
fellow researchers that this first definition was not 
sufficient to cover the wide variety of complex data. 
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Figure 1 : Axes of data complexity 



3 COMPLEX DATA WAREHOUSE 
ARCHITECTURE 
FRAMEWORK 

In opposition to classical solutions, complex data 
warehouse architectures may be numerous and very 
different from one another. However, two approaches 
seem to emerge. 

The first, main family of architectures is data- 
driven and based on a classical, centralized data ware- 
house where data are the main focus. XML doc- 
ument wareho uses (|Xyleme, 2001] |Baril and Bel- 
lahsene, 2003; Humme r et al., 2003[ |Nassis et al., 
2004) are a examples of such solutions. They often 
exploit XML views, which are XML documents gen- 
erated from whole XML documents and/or parts of 
XML documents. A data cube is then a set of XML 
views. 

The second family of architectures includes so- 
lutions based on virtual warehousing, which are 
process-driven and where metadata play a preemi- 
nent role. These solutions are based on mediator- 
wrapper approaches and exploit distributed data 
sources. These sources' schemas are one of the main 
information mediators exploit to answer user queries. 
Data are collected and multidimensionnally modeled 



recording, identification of a character in a video se- 
quence...)- Processing the data thus turns out to pro- 
cess their descriptors. Original data are stored, for in- 
stance as binary large objects (BLOBs), and can also 
be exploited to extract information that could enrich 
their own caracteristics (descriptors and metadata). 

The architecture framework we propose for com- 
plex data warehousing (Figure [2) exploits the XML 
language. Using XML indeed facilitates the inte- 
gration of heterogeneous data from various sources 
into the warehouse; the exploitation of metadata 
and knowledge (namely regarding the application do- 
main) within the warehouse; and data modeling and 
storage. The presence of metadata and knowledge 
in the data warehouse is aimed at improving global 
performance, even if their actual integration is still 
the subject of several research projects (McBrien and 
Poulovassilis, 20011|Bar"il and Bellahsene, 20031|Shah 
and Chirkova, 2003 j 

This architecture framework is essentially made of: 
the data warehouse kernel, which may be either ma- 
terialized as an XML warehouse, or virtual (where 
cubes are computed at run time); data sources; source 
type drivers that notably include mapping specifica- 
tions between the sources and XML; and a metadata 
and knowledge base layer that includes three submod- 
ules related to three management processes. 

The three processes for managing a data ware- 
house are: the ETL and integration process that 
feeds the warehouse with source data from opera- 
tional databases (DS Op) by using drivers that are 
specific to each source type (ST); the administration 
and monitoring process (MDlkKR) that manages 
metadata and knowledge (the administrator interacts 
with the data warehouse through this process); and 
the analysis and usage process that runs user queries, 
produces reports, builds data cubes, supports OLAP, 
etc. Each of these processes exploits and updates the 
metadata and the knowledge base. There are four 
types of flows: the external flow, which includes the 
ETL and integration flow and the exploitation (analy- 
sis and usage) flow (the warehouse may thus be con- 
sidered as a black box); the internal flow, between the 
warehouse kernel and the metadata and knowledge 
base layer and between the metadata and knowledge 



work, it helps us formalizing the warehousing process 
of complex data as a whole. Thus, we are able to iden- 
tify the issues to be solved. We can also point out the 
great importance of metadata in managing and ana- 
lyzing complex data. Furthermore, piloting and syn- 
chronizing the data warehouse processes we identify 
in this framework is a whole problematic in itself. Op- 
timization techniques will be necessary to achieve an 
efficient management of data and metadata. Commu- 
nication techniques, presumably based on known pro- 
tocols, will also be needed to build up efficient data 
exchange solutions. 



4 CONCLUSION AND 
PERSPECTIVES 

We addressed in this paper the problem of ware- 
housing complex data. We first clarified the con- 
cept of complex data by providing a precise, though 
open, definition of complex data. Then we presented a 
general architecture framework for warehousing com- 
plex data. It heavily relies on metadata and domain- 
specific knowledge, which we identify as a key ele- 
ment in complex warehousing, and rests on the XML 
language, which helps storing data, metadata and 
knowledge, and facilitates communication between 
the various warehousing processes. This proposal 
takes into account the two main possible families of 
architectures for complex data warehousing (namely 
virtual data warehousing and centralized, XML ware- 
housing). Finally, we rapidly presented the main is- 
sues in complex data warehousing, especially regard- 
ing data integration, the modeling of complex data 
cubes, and performance. 

This study opens many research perspectives. Up 
to now, our work mainly focused on the integration 
of complex data in an ODS. Though we also worked 
on the muldimensional modeling of complex data, 
this was our first significant advance into the actual 
warehousing of complex data. In order to test and 
refine our hypotheses in the field, we plan to apply 
our proposals on three different application domains 
we currently work on (medicine, banking and geogra- 
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Figure 2: Complex data warehouse architecture framework 



sent metadata are XML and RDF (Resource Descrip- 
tion Framework) schemas. These tools are appro- 
priate to represent both low-level and semantic de- 
scriptors. Furthermore, they are adapted to reasoning 
for metadata exploitation. The Common Warehouse 
Metamod el (CWM), a n OMG standard for data ware- 
houses (OMG , 2003| ), could also help us managing 
metadata and knowledge. But can the CWM meta- 
models integrate the performance factors of a com- 
plex data warehouse? Should these metamodels be 
extended or would it be more interesting to propose 
new submodels instead? These are largely open ques- 
tions. 
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